Skip to content

Add gatherf_byte_inds for byte indices from memory #511

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Nov 4, 2024

Conversation

rygorous
Copy link
Contributor

@rygorous rygorous commented Nov 3, 2024

All the gathers in the codebase pass a vint for indices that has just been initialized from an array of uint8_ts in memory.

This is significant because for the NEON/SSE emulation paths, there is no native gather instruction to begin with and the first step is to get the indices back to the integer pipe and split them into individual pieces. In this case it is definitely better to just load the indices on the int pipes to begin with; this formulation facilitates that. (Needs to be a template because unlike the original gatherf, there is no vint argument that implies the vector width for overload resolution.)

Additionally, the gathers in this codebase don't actually make use of predication (the predicates are always all on). That means we have a subset of gather functionality that is fairly easy to emulate manually: indices are readily available on the integer pipes, and no predication, so all we need to do is perform a known number of vector loads and assemble the result.

Therefore, provide an option to avoid gather instructions even on AVX2 where they do exist. Gather performance is middling on newer Intel uArchs and outright bad on older (pre-Skylake) P-core Intel uArchs, Intel's E-cores, and AMD's offerings. At least on my home Zen 4, doing the 8 broadcasts + shuffles is much faster than using the native gather instructions, to the tune of a ~13.5% reduction in total coding time.

Test results: (using MSVC 2022 as compiler)

  • On Intel Skylake-X, using the manual gathers is appreciably slower than the native gather instructions. (+6% coding time in my tests)
  • On AMD Zen 2 and Zen 4, avoiding gathers is much faster (as noted above, 13.5% reduction on Zen 4).
  • On Intel Redwood Cove and Intel Crestmont, avoiding gathers comes out around 3-4% faster in my tests depending on the test.

Fabian Giesen and others added 2 commits November 3, 2024 11:20
All the gathers in the codebase pass a vint for indices that
has _just_ been initialized from an array of uint8_ts in memory.

This is significant because for the NEON/SSE emulation paths, there
is no native gather instruction to begin with and the first step
is to get the indices back to the integer pipe and split them into
individual pieces. In this case it is definitely better to just load
the indices on the int pipes to begin with; this formulation
facilitates that. (Needs to be a template because unlike the original
gatherf, there is no vint argument that implies the vector width
for overload resolution.)

Additionally, the gathers in this codebase don't actually make use
of predication (the predicates are always all on). That means we have
a subset of gather functionality that is fairly easy to emulate
manually: indices are readily available on the integer pipes, and
no predication, so all we need to do is perform a known number of
vector loads and assemble the result.

Therefore, provide an option to avoid gather instructions even on
AVX2 where they do exist. Gather performance is middling on newer Intel
uArchs and outright bad on older (pre-Skylake) P-core Intel uArchs,
Intel's E-cores, and AMD's offerings. At least on my home Zen 4, doing
the 8 broadcasts + shuffles is _much_ faster than using the native
gather instructions, to the tune of a ~13.5% reduction in total coding
time.

Test results: (using MSVC 2022 as compiler)
- On Intel Skylake-X, using the manual gathers is appreciably slower
  than the native gather instructions. (+6% coding time in my tests)
- On AMD Zen 2 and Zen 4, avoiding gathers is much faster (as noted
  above, 13.5% reduction on Zen 4).
- On Intel Redwood Cove and Intel Crestmont, avoiding gathers comes
  out around 3-4% faster in my tests depending on the test.
@solidpixel
Copy link
Contributor

On my home machine (Intel i5-6500K, CoffeeLake):

  • SSE4.1 - 3-4% faster by avoiding the byte-to-int conversion.
  • NoGather AVX2 - comes in around 6% slower (tested with both Clang 14 and GCC 11)

@solidpixel solidpixel self-requested a review November 4, 2024 21:43
@solidpixel
Copy link
Contributor

solidpixel commented Nov 4, 2024

On my laptop (Apple M1)

  • NEON - is 2% faster by avoiding the byte-to-int conversion.

@solidpixel solidpixel merged commit 546f9dd into ARM-software:main Nov 4, 2024
7 checks passed
@rygorous rygorous deleted the avoid-gathers branch November 9, 2024 00:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants